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ABSTRACT 



We describe a hybrid Fourier/direct space convolution algorithm for compact radial (azimuthally symmetric) kernels 
on the sphere. For high resolution maps covering a large fraction of the sky, our implementation takes advantage of 
the inexpensive massive parallelism afforded by consumer graphics processing units (CPUs). Its applications include 
modeling of instrumental beam shapes in terms of compact kernels, computation of fine-scale wavelet transformations, 
and optimal filtering for the detection of point sources. Our algorithm works for any pixelization where pixels are 
grouped into isolatitude rings. Even for kernels that are not bandwidth-limited, ringing features are completely absent 
on an ECP grid. We demonstrate that they can be highly suppressed on the popular HEALPix pixelization, for which 
we develop a freely available implementation of the algorithm. As an example application, we show that running on a 
high-end consumer graphics card our method speeds up beam convolution for simulations of a characteristic Planck high 
frequency instrument channel by two orders of magnitude compared to the commonly used HEALPix implementation 
on one CPU core, while typically maintaining a fractional RMS accuracy of about 1 part in 10 5 . 

Key words. Methods: data analysis - Methods: numerical - Techniques: image processing - cosmic background radiation 



1. Motivation and goals 

Convolving with radial (i.e. azimuthally symmetric) kernels 
is a key step in some of the most frequently used algorithms 
during the simulation and analysis of cosmological data sets 
represented on the celestial sphere, such as maps of the 
cosmic microwave background (CMB). 

All current and future CMB experiments have 
many (100-10 4 ) d etectors (e.g., the Atacama 



Cosmology Telescope, iKosows kvl 1200.1 the South Pole 



Telescop e. iRuhl et al 
mission 



2004, 



the proposed CMBPol 



, Bauma nn et al.l 120091 or the Planck satellite, 
iPlanck Collaboration et al.l 1201 1[ ). Simulating the signal 



in these data sets requires very many beam smoothing 
operations since each detector map will contain the same 
CMB map smoothed with a separate beam shape. The 
same is true for map-making methods that compute the 
optimal combination of a large number of detectors in 
an iterative process and therefore also requir e a huge 
number of beam smoothing operations (e .g., iTegmarkl 
fl^9^lNatoli"et^[2^rjltlStompor et al.ll2002D . 

Several C MB analysis techn i ques, such as a wavelet 



analy s is (e.g., Hobson et alJll99& iMartmez-Gonzalez et al.l 
120021: IVielva et al.l l2004j). and the filtering to detect 
point sources (e.g ., iTegmark k, de Oliveira-Costal Il998t 
ICavon et all 120001 iGonzalez-Nuevo et al.l 120061) require 
smoothing of high resolution maps with symmetric kernels 
that have (or are well-approximated as having) compact 
support on the sphere. In a wavelet analysis, the compu- 
tational time for the wavelet transform is dominated by 



the computation of the fine-scale wavelet coefficients. By 
construction, the fine-scale wavelets are compact in pixel 
space. 

Current practice in CMB data analysis is the near- 
exclusive use of the fast spherical harmonic trans- 
form (FSHT) for c onvolution with radial kernels 
()Muciaccia et al.l 119971) . Mature and highly efficient 
implementations of this algorithm are publicl y available in 
sever al packages, s uch as, e.g., HEALPbPl (jGorski et al.l 
[2005h . G LESFFI (iDoroshkev et all [2005^ 7 ccSfflfl or 
libpshtQ (|Reineckdl201ll) . 

In the vast majority of cases that a spherical transform 
is calculated during a CMB data analysis, it is to compute 
a convolution with a radial kernel. Examples are: (1) sim- 
ulating CMB maps, in which case the radial kernel is the 
"square root" of the CMB power spectrum; (2) simulat- 
ing observed detector maps with a symmetric beam profile; 
(3) filtering to extract point sources, or hot and cold spots 
on certain scales; and (4) all forms of symmetric wavelet 
analysis. 

While generally correct, this approach is not optimal 
when convolving high resolution maps with sufficiently 
compact kernels. We show that significant speed-up is possi- 
ble with an algorithm that makes nearly optimal use of mas- 
sively parallel fast Fourier transforms (FFT) on consumer 
graphical processing units using a hybrid direct-space and 
Fourier space approach. 
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The plan of this paper is as follows. We begin with a 
mathematical definition of the problem and the quantities 
involved (Section ^ . We then briefly review the convolu- 
tion approaches using a direct sum and the fast spherical 
harmonic transform in Section EH while discussin g a GPU 
implementation of the FSHT (jHupca et al.ll2010D . We in- 
troduce our algorithm and its implementation on a con- 
sumer GPU in Section 2J which also contains benchmark 
results and tests of the numerical accuracy of the algorithm. 
Finally, we summarize our findings in Section [SJ 

The benchmarks of the algorithms reported in this pa- 
per were performed on an Intel Core2 Quad CPU with 
2.8 GHz and 8 GB of random access memory (RAM). The 
system cost (other than the GPU) was about US$ 1000. As 
a reference, we use the popular HEALPix Fortran package 
version 2.15 and the highly optimized libpsht C++ FSHT 
library. We note that starting with the latest release, ver- 
sion 2.20, the libpsht routines will also be called by default 
in the HEALPix package. Our GPU code was timed on a 
NVIDIA GeForce GTX 480 that we bought for US$ 500. 

2. Definitions and notation 

It is useful to state the "platonic ideal" of what is to be 
accomplished. Given a rough map r, we would like to cal- 
culate the smooth map s 

s(hx)= / K(h 1 ,h 2 )r(h2)d 2 h2, (1) 
Js 2 

where n denotes a unit vector on the sphere. For a symmet- 
ric (or radial) kernel, K(hi,fi2) = K{ni.fi2)- We introduce 
the short-hand notation p.q = n p .h q = cos (<(n p , h q )), so 
K(h x ,n 2 ) = K(l.2). 

A band-limited function on the sphere can be defined 
in spherical harmonic space by specifying a set of spherical 
harmonic coefficients ai m for all I from zero up to band- 
limit ^ max . Unless otherwise stated sums are over all non- 
zero terms. With the Legendre transform convention 

K e = 2n J K(z)P t {z)dz, (2) 

the kernel can be expanded in terms of Legendre polyno- 
mials as 

We assume that the kernel has the same band-limit as the 
input map. Recalling the addition theorem for spherical 
harmonics Y lmp = Y lm (h p ) 

E Y ^vYL q = ^- p z M . ( 4 ) 

m 

we obtain 

Si m = Kt r em . (5) 

This equation is exact if the r£ m are known. In many cases 
of interest, however, the map will be available in a sampled 
or pixelized representation with a number of pixels n p i x . 
In this case, estimating the ri m from the sampled repre- 
sentation may introduce a quadrature error. We keep this 
in mind when discussing the convolution accuracy in the 
following. 



3. Methods 

3.1. Direct sum 

The direct sum follows from the straightforward discretiza- 
tion of Equation (TTJ) 

Sp = K (P-l) r q • (6) 

1 

It is easy to check that this approach will yield the same 
output map as Equation (|5|) if the rg m are calculated by 
direct sum over the same equal-area pixelization and both 
map and radial kernel are band-limited functions. In gen- 
eral, this method scales as 0(n^ ~ ^^ ax ), both in terms of 
memory accesses and in terms of floating point operations 
(FLOP). The prefactor can be made small by caching K(z), 
e.g. by interpolating it in 0(1) operations from 0(£ max ) 
precomputed values. This also reduces the number of ac- 
cesses to non-cached memory to 0(^max)- 

If the kernel is compact such that K(p.q) = Vp.q < zk, 
i.e. for an angle between p and q larger than a threshold 
9k, the operation count reduces by a factor zk/2. For suf- 
ficiently compact kernels this method would therefore win 
over other methods with smaller asymptotic time complex- 
ity but larger prefactors. 

The direct sum has a great degree of parallelism, at 
least at the level of 0(n p i x ~ threads, since each out- 

put pixel is the result of a dot product of one row of K with 
r. Theoretically, even more parallelization can be achieved 
by parallelizing the dot product, though care must be taken 
to avoid race conditions when accumulating the smoothed 
map in parallel. In practice, care must be taken to keep 
the number of non-cached memory accesses low since the 
computation would otherwise be limited by memory band- 
width. Since the number of memory accesses is of the same 
order as the number of calculations, the potential for GPU 
implementation is limited. Direct pixel space convolution 
will therefore be superior only for kernels that are too small 
(of widths narrower than a very small number of pixels) to 
be of broad practical interest. 

3.2. Fast spherical harmonic transform 

Pixelizations consisting of uniformly sampled isolatitudc 
rings allow for a fast spherical harmonic transform 
(FSHT), with overall scaling 0(^f nax ) to take advantage 
of Equation ([5]). 

In detail, FFTs on the 0(ng ~ l max ) isolatitude 
rings (each containing 0(n^ ~ f miU[ ) pixels) are done in 
0(^max log^max) operations. The resulting Fourier compo- 
nents b m (9) can be transformed into spherical harmonic 
coefficients rp m by applying an associated Legendre trans- 
form taking 0(^ax) operations per ring, for a total of 
0(ne^max ~ ^max) operations. One obtains the smoothed 
map by multiplying the rg m with the kernel coefficients 
Ki and applying the inverse spherical harmonic transform 
that inverts the above steps in reverse order again taking 
0(^max) operations. 

The Legendre transforms therefore dominate the scal- 
ing since applying Equation ([5]) takes only 0(^„ lax ) time. 
Furthermore, at high t, the recursions necessary to com- 
pute the associated Legendre functions become increas- 
ingly less accurate and need to be done in double preci- 
sion with frequent rescaling to avoid floating point under- 
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flows (jGorski et al.ll2005lh Implementing the FSHT algo- 
rithm on consumer CPUs with reduced double precision 
performance is therefore non-trivial. The inverse spheri- 
cal harmonic transform was implemented and benchmarked 
citing O(10) spe ed gains with respe ct to the S2HAT0 CPU 
implementation ()Hupca et al.ll20l6T) . 

The FSHT algorithm is popular since it reduces the 
computational scaling by a factor ^ max compared to the 
direct sum and yet yields the same result (for infinite pre- 
cision arithmetic) . Approximating the continuous spherical 
harmonic transform by a finite, discrete sum over pixels 
introduces the same error as approximating the kernel in- 
tegral by such a discrete sum. If a quadrature rule is applied 
to improve the accuracy of the ri m , the same quadrature 
weights can be used in Equation ^ to reach the identically 
improved results. 

Utilizing the popular HEALPix library on a modern 
CPU, the serial time for a pair of forward and inverse trans- 
forms at Planck resolution (n si( i c — 2048, £ max — 4096) is 
t w 460 s. Since using FSHTs is by far the most common 
algorithm of choice for symmetric kernel convolutions, this 
is the reference time for comparisons with other algorithmic 
approaches. 

Methods implementing divide-and-conquer schemes for 
fast transforms on non-Abelian groups have a smaller 
asymptotic s caling of CPU time wi t h problem size tha n 
the FSHT (|Driscoll k Healvl [l99i iWiaux et all l200l . 
However, these more sophisticated methods require large 
amounts of RAM to store precomputed information that 
renders them impractical for problem sizes of interest for 
CMB maps, with tens of millions of pixels e.g. from Planck. 
For smaller problem sizes, the comparatively large complex- 
ity of the algorithm causes actual implementations to be 
slower than algorithmically simpler approaches. 

4. Hybrid method 

We now outline a straightforward hybrid method that com- 
bines aspects of the direct summation and spherical har- 
monic transformation approach. It is based on the simple 
idea of convolving along isolatitude rings via computation- 
ally inexpensive FFTs by means of the convolution theo- 
rem, and integrating in the longitudinal direction in pixel 
space. 

This hybrid method is redundant in a way that the prod- 
uct of the kernel image and the input map must be evalu- 
ated once on every ring prior to the summation. The com- 
putational costs amount to O(£ max \og£ max ) operations for 
each FFT on a ring, which must be repeated 0{ng ~ £ max ) 
times for each of the 0(ng ~ I max) rings. In total, the al- 
gorithm requires C(£^ ax log^ max ) operations, formally in- 
ferior to the conventional FSHT approach. Unlike that 
method, however, if the convolution kernel has finite sup- 
port on only n S upport < n>6 rings, the computational com- 
plexity decreases linearly, £>(n support ^ ax log£ max ). It is 
dominated by FFTs for which highly optimized implemen- 
tations with a small prefactor exist. Furthermore, the algo- 
rithm intrinsically offers an extreme amount of data paral- 
lelism, making it in particular suitable for an implementa- 
tion on GPUs with hundreds of cores. 

In practice, the algorithm field-of-application is limited 
to compact kernels. If a kernel is formally non-zero across 

5 http : //www. ape .univ-paris7 . f r/~radek/s2hat .html 



the entire sphere, but vanishes sufficiently fast beyond a 
given angular distance a cu t, it can be truncated at that 
radius without introducing significant errors. For the con- 
volution of an isotropic map with power spectrum Ci, the 
mean quadratic error introduced by this approximation can 
be estimated to be 

a 2 = Y^(2£+l)AK?C e , (7) 

where AKg is the Legendre expansion of the difference of 
the exact and the truncated kernel. 

4.1. Overview of the algorithm 

We now describe the GPU implementation of the convo- 
lution algorithm for an input map in HEALPix format in 
greater detail. We visualize the individual steps of the al- 
gorithm in Figure [TJ 

HEALPix maps with resolution parameter n S id are di- 
vided into three regions, the north polar cap, the equatorial 
region, and the south polar cap. Each of the two caps con- 
sist of n caps = n s id e — 1 rings, where the nth ring (counted 
from the pole) contains 4 n pixels. The equatorial region 
comprises n cqu = 2 ?i S idc + 1 rings with a fixed number of 
4 n S jdo pixels per ring. 

We perform a real-to-complex Fourier transform of 
length tifft on each ring, where nppT = 4n s ide in the equa- 
torial region but only tifft = 4 n in the polar caps. We then 
zero-pad the Fourier coefficients around the poles to gener- 
ate a rectangular array of 4 n s id e — 1 sets of 2 n s ;d e + 1 Fourier 
coefficients each. The rectangular shape of this array allows 
us to use the batch FFT mode of our FFT library for the 
Fourier convolutions that gives us a significant time saving 
since FFTs dominate our computational time budget. As 
every ring in the polar and every other ring in the equato- 
rial region of a HEALPix map is s hifted by <ftp = 7r/ (4n) 
and (j)Q = 7r/(4?i si dc), respectively ()G6rski et al.l 12003 ). we 
compensate for this distortion by phase-shifting the rath 
Fourier coefficient 

b' m (9) = b m (9)e im ^. (8) 

After preparing the input map in this way, we can start 
the convolution process, which loops over all rings in the 
output map. For each ring at latitude 8q, we generate a 
kernel grid of size n support x 4 n s id pixels and place the ker- 
nel at its center at (9,<fr) — (0o>O). To finally interpolate 
the kernel on this grid, we first calculate the angular dis- 
tance a between a pixel located at (0, 4>) and the kernel. A 
Taylor expansion to second order in a allows us to rewrite 
the equation in a simplified way 

a 2 a 2 [sin 2 (l/2(0 + O )) + sin 2 (l/2(0 - 9 ))] 

-2 [sin 2 (1/2(6* + O )) - sin 2 (1/2(0 - O ))] cos(</>) . 

(9) 

Using this parametrization of the angular distance has 
the advantage of making a numerically expensive arcco- 
sine operation redundant. Though Equation © is for- 
mally only applicable in the small angle regime, we can 
exactly compensate for the error by systematically bias- 
ing the kernel evaluation. To this end, we first define 
e(a) = cos(a) — (1 — 1/2 a 2 ) as the error introduced by 
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the approximation. Instead of precomputing a table of the 
form [a 2 , if (a)], we then store a datastructure containing 
the values a 2 , K(a + e(a))~\ . To evaluate the kernel, we 
interpolate linearly from the table. 

After interpolating the kernel on the grid, we perform an 
efficient radix-2 batch FFT. The kernel Fourier coefficients 
are then multiplied by those of the input map. As a next 
step, for each ring pixel, we calculate the summation over 
^support elements in the 0-direction. Finally, we store the 
output-ring Fourier coefficients and continue to the next 
ring. 

To transform back to a HEALPix map, we reverse 
phase-shift the rings and perform a complex-to-real back- 
ward FFT on all output rings. In polar rings, we truncate 
the Fourier expansion to the Nyquist frequency of each in- 
dividual ring. An alternative procedure would have been 
to explicitly alias the super-Nyquist modes to sub-Nyquist 
modes for each ring. Our tests show that this produces neg- 
ligible differences for maps with a band- limit £ ma x < 2 n s id c - 

Our approach avoids the recursions for the associated 
Legendre functions. Therefore, the entire algorithm could 
in principle be computed in single precision, which renders 
it particularly suitable for inexpensive consumer GPUs that 
have limited double precision capability. However, for very 
narrow or highly varying kernel in particular, calculating 
the angular distance between a given pixel and the ker- 
nel center, Equation ©, becomes imprecise. We therefore 
compute this quantity using double precision arithmetic for 
the entire calculation. Further performance improvements 
at the 10 % level can be realized by partially relaxing this 
requirement for sufficiently smooth kernels or less stringent 
accuracy goals. 

4.2. Implementation and benchmarks 

The algorithm was implemented in C++ using the NVIDIA 
CUDA 3.2 programming toolkit to generate GPU-related 
code sequences. We also implemented the hybrid algorithm 
for the CPU mostly for validation purposes. To obtain a 
program that runs efficiently on GPUs, a basic understand- 
ing of the hardware properties is necessary. We therefore 
discuss some of the relevant aspects in the following. 

GPUs are streaming devices designed for highly parallel 
high throughput data processing. The hardware used here, 
a NVIDIA GeForce GTX 480, is a consumer GPU featuring 
15 multiprocessors with a total of 480 shader processors. It 
is equipped with 1.5 GB of memory and has a nominal 
peak performance of 1.3 TFLOP/s in single precision. The 
value for double precision arithmetic would be half as large 
but has been intentionally degraded by an additional factor 
of four by the vendor. For comparison, we note that the 
performance of the quad-core CPU used for our benchmark 
tests is about 45 GFLOP/s. 

The latency of main memory accesses with a theoretical 
bandwidth of 177 GB/s is large: we typically expect a delay 
of several hundred clock cycles from the fetch command to 
the point where the requested datum is actually available. 
In the latest generation of NVIDIA GPUs, a LI cache of 
up to 48 KB, dedicated to a specific multiprocessor, and a 
global L2 cache of 768 KB may reduce the memory latency. 
Besides main memory, the GPU offers up to 48 KB of low 
latency shared and 64 KB of cached constant memory. In 



addition, at most 63 registers with virtually instantaneous 
access and the highest data throughput rate are available 
per thread. 

In general, there are two means of hiding memory la- 
tencies: thread-level parallelism and instruction-level par- 
allelism. On GPUs, threads are very lightweight and the 
switching between them is fast. Common practice is there- 
fore to divide the work load over considerably more threads 
than physically available computing cores. The second 
strategy is less obvious. It attempts to calculate several 
unconditional outputs within the same thread. If a thread 
encounters a cache miss while computing result A, it can 
partially hide the latency by continuing to work on the 
independent result B. Our tests show that reducing the to- 
tal number of active threads by exploiting instruction-level 
parallelism enhances code performance by several tens of 
percent. 

Computations on the GPU are triggered by launching 
kernels (not to be confused with "convolution kernels"). 
As a parameter to such a function call, we have to specify 
how the work load should be processed. More precisely, we 
define a grid consisting of ID or 2D blocks that are con- 
secutively assigned to a GPU multiprocessor. Each block 
contains information about the number of threads to be 
executed on the device. 

Our hybrid algorithm can be implemented in a form 
that is particularly well suited to calculations on GPUs. 
After transferring the regridded input map to device mem- 
ory, we use the CUFFT library of toolkit version 3.1 to 
compute a ringwise out-of-place real-to-complex FFT on 
the input map. In contrast to claims in the vendor's release 
notes accompanying the latest version 3.2, we found the 
predecessor of the algorithm to be more efficient in terms 
of both computational performance and memory consump- 
tion, though it proves less accurate for certain transforma- 
tion lengths. 

To interpolate the convolution kernel on the grid, we 
launch a GPU kernel comprising a two-dimensional grid of 
^side/512 x n sup port blocks with 128 threads. Each thread 
computes the output for 8 different pixels within a ring, 
with a stride given by the block width. For an efficient 
interpolation, we stored the longitudes of all rings and the 
interpolated convolution kernel in constant memory. This 
code section is compute bound and can be accelerated by 
taking advantage of the grid symmetries. We calculate the 
kernel only on the first 2 n s idc + 1 pixels of a ring and only 
for the northern hemisphere. 

After the function call has been completed, we execute 
a real-to-complex batch FFT on the kernel map. 

The reduction along the ^-direction is computed par- 
tially using thread-save atomic add operations, available 
on GPUs of compute capability 2.0 and higher. We launch 
a GPU kernel with a two-dimensional grid of [(2 n S idc + 
l)/64] x [^support/32] blocks with 64 x 2 threads. Each 
thread accumulates the product of input map and kernel 
on up to 32 rings in a local variable and adds the result via 
an atomic operation to the global output map. A block 
performs this calculation on the northern and southern 
hemisphere simultaneously. Although a sophisticated de- 
sign pattern for the common reduction problem exists, we 
found this approach to be more efficient because we have 
to deal with non-contiguous memory accesses. This func- 
http://developer.nvidia.com/object/cuda_3_2_downloads.Mtffiil:s memory bound, but we reach up to 145 GB/s SUS- 
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Fig. 1. Illustration of the convolution algorithm. To smooth a specific ring of the input map (upper left), we first 
interpolate the kernel on the grid, centered on this particular ring (upper right). After a ringwise Fourier transformation 
of both map and kernel, we multiply the two components (lower left, transformed to pixel space for illustrative purposes). 
Finally, we perform the summation alongside the longitudinal direction and write the inverse Fourier transformed result 
to the output (lower right image). 



tained memory throughput, above 80 % of the theoretically 
achievable peak performance. 

We finally compute an inverse complex-to-real FFT be- 
fore we transfer the data back to host memory. 

In contrast to the exact solution, we find the error in 
the first and last ring to be unexpectedly enhanced. This is 
probably the result of amplified numerical errors. Although 
more of a cosmetic correction than one motivated by accu- 



racy considerations, we recalculate the values of these eight 
pixels via a direct sum. 

With an implementation as described above, a signifi- 
cant speedup compared to a FSHT-based convolution can 
be realized for compact kernels. We show the results of our 
benchmark tests in Figure [5J where we compare the run- 
time of our algorithm for different map resolution param- 
eters and kernel sizes to that of the HEALPix FSHT, and 
to the optimized libpsht library on a single CPU core. We 
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Table 1. Breakdown of the total runtime into the contri- 
butions of the three most important code sections for the 
convolution of a map at n s ;d c = 2048 with kernels of various 
sizes. 



Kernel support 


1° 


4° 


16° 


FFTs 


57% 


47 % 


39 % 


Kernel evaluation 


19 % 


31 % 


39 % 


Ring reduction 


10 % 


16 % 


21 % 


Others 


14% 


6 % 


1 % 



note that the GPU timings include the time for transferring 
the input map from host memory to GPU memory and the 
output map from the GPU back to the host. 

We observe a dependence of the performance gain on 
the resolution parameter n s idc- For comparatively low- 
resolution maps and small-to-moderate kernel sizes, we find 
the poorer numerical efficiency to be the result of too small 
a workload to enable the efficient use of the available GPU 
hardware resources. In addition, we explicitly optimized our 
code for fast convolutions at n s ;dc = 2048. For a kernel sup- 
port of 1°, the runtime is completely dominated by the com- 
putational costs of the FFTs. However, with an increasing 
kernel size, the scaling behavior changes as both the ker- 
nel evaluation and the multiplication of the Fourier coeffi- 
cients of kernel and map become more and more expensive. 
Accordingly, the higher efficiency that we achieve flattens 
out towards higher kernel diameters. We specify the frac- 
tional computational costs of the different code sections for 
the convolution at n s id c = 2048 in Table Q] Since the three 
major parts of the algorithm take up comparable amounts 
of GPU time, further implementation optimizations for the 
GPU are unlikely to result in performance gains signifi- 
cantly larger than the 10 % level. 

In the case of the narrow convolution kernels often en- 
countered during CMB map beam-convolution processes, 
performance improvements of up to two orders of magni- 
tudes can be achieved. For example, smoothing a HEALPix 
map with resolution n s id c = 2048, f max = 4096 us- 
ing a Gaussian kernel of 4.7' full width at half maxi- 
mum (F WHM), a realistic value for the Pla nck 217 GHz 
channel (jPlanck HFI Core Team et al.l 1201 lh takes about 
^ARKCoS = 2.2 s non the GPU, whereas the FSHT-based 
approach requires iHEALPix = 460 s and tnbpsht = 160 s on 
one CPU core. Although the intrinsically parallel structure 
of the algorithm can be most beneficially exploited when 
run on GPUs, a CPU-based implementation may also be 
appropriate for very compact kernels. For the setup dis- 
cussed above, the convolution takes about i^RKCoS = 20 s 
on one single CPU core, which is still considerably faster 
than the FSHT-based code. 

The cost of realizing these performance gains is to add 
a GPU at about half of the cost of the quad-core host 
system. To compare performance per hardware dollar, the 
GPU timings should be compared to half the CPU timings. 



4.3. Accuracy tests 

Our accuracy goal was to achieve a fractional root mean 
square (RMS) accuracy of O(10 -4 ) or lower, which would 
be sufficient for most CMB applications. We assessed the 
accuracy of the newly developed algorithm on the basis of 




Fig. 2. Performance gain of the GPU-based convolution 
code when compared to the HEALPix FSHT-based imple- 
mentation (left axis) and the libpsht library (right axis) 
running on a single CPU core for different map resolution 
parameters and kernel support. 



both the pixel space representation of the convolved maps, 
and their power spectra. 

For the first test, we computed difference maps of the 
output generated by ARKCoS and HEALPix. Using a 
Gaussian kernel with 1° FWHM, as plotted in Figure [3J 
we show the result of the comparison in Figure 2J We note 
that we normalized the difference using the RMS of the ref- 
erence map to obtain a relative percentage error. We find 
the small remaining residual around the polar caps to be 
dominated by outliers produced by the FFT library for spe- 
cific transformation lengthtQ, whereas inaccuracies in the 
kernel evaluation prevail in the equatorial region. Averaged 
over the entire map, ARKCoS reproduces the results from 
the HEALPix package for different kernels with a fractional 
RMS error of at most O(10~ 4 ), which decreases rapidly for 
kernel sizes > 0.5° FWHM. 

As a second test, we compared the power spectrum of 
the convolved map with the theoretical expectation. With 
a FWHM of 6', we chose a very narrow Gaussian kernel 
close to the grid resolution at n s ide = 2048 that is no longer 
band- limited at £ miiX = 4096. The reference power spectrum 
used in this test was calculated exactly from the spherical 
harmonic representations of input map and kernel. For one 
realization, the result is shown in the left-hand panel of 
Figure [5l and interpreted in terms of the cross-power spec- 
trum between the map and the induced error. 

In addition, we show the power spectrum of the dif- 
ference map, where we again compare the output of our 
algorithm to the exact solution. Here, the reference map 
was derived via Equation ([5]) from the spherical harmonic 
coefficients of input map and kernel. For one realization, 



Tests with the identical code linked to the more recent 
CUFFT library version 3.2 made these outliers disappear, but 
performance suffered at the 15 % level. 
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we show the result in the right-hand panel of Figure [5j it 
represents the auto power spectrum of the error. In both 
cases, we find the error in the power spectra to be subdomi- 
nant over the full dynamical range of about 14 magnitudes^!, 
showing that the algorithm does not introduce a significant 
level of artificial mode coupling. 

We conclude with the remark that highly compact ker- 
nels of scale-lengths smaller than w 10' FWHM, a regime 
of particular relevance to beam convolution at the resolu- 
tion of the Planck high frequency instrument, will suffer 
from truncation errors if a band-limit of £ max = 4096 is im- 
posed. As a result, the back-transformed pixel space rep- 
resentation starts to show ringing artifacts. In contrast to 
FSHT-based algorithms, it is possible to suppress this ef- 
fect in our convolution scheme. The first modification of the 
algorithm concerns the treatment of super-Nyquist modes 
in the polar caps. These modes are available to us because 
we supersample the kernel in direct space on 4 n s idc points 
on all rings. After performing the forward Fourier trans- 
form of the input map, we now duplicate the coefficients to 
obtain a fully populated rectangular grid with 2 n s id c + 1 
elements on all rings. Likewise, we add the super-Nyquist 
modes to the sub-Nyquist modes prior to calculate the in- 
verse Fourier transform of the output map. This (optional) 
step in the algorithm adds a factor of less than two to the 
computational time. In the equatorial region, the error can 
be removed completely if we slightly alter the algorithm on 
every unshifted ring. Here, we start the convolution using 
the unmodified Fourier transform of the input map, that is, 
we do not apply Equation ©. We instead take into account 
the offset on every other ring during the kernel evaluation, 
i.e., we substitute cos((/>) with cos(0 — </>o) in Equation ([9]). 
In Figure [51 we compare the output of our modified algo- 
rithm to that of a FSHT-based scheme for the convolution 
of several point sources with a Gaussian beam of width 
4.7' FWHM at n sido = 2048. The conventional approach 
suffers from spurious ringing effects that extend well be- 
yond the formal support of the kernel. Using ARKCoS, the 
artifacts are completely absent in the equatorial region, and 
suppressed and confined to the latitudinal direction in the 
polar caps. We note that on the more regular ECP-grid, the 
ringing pattern would vanish exactly on the entire sphere 
without the need to modify the algorithm. 



5. Discussion and conclusion 

We have presented an implementation of a GPU- 
accelerated hybrid algorithm for radial kernel convolution 
on the sphere. It performs the convolution along isolatitude 
rings in Fourier space and integrates in longitudinal direc- 
tion in pixel space. We call this algorithm ARKCoS. As the 
computational costs scale linearly with the kernel support, 
the method is most beneficial for convolution with compact 
kernels. Typical applications include CMB beam smooth- 
ing, symmetric wavelet analyses, and point-source filtering 
operations. 

For a convolution with compact kernels, we find that 
our implementation realizes real performance gains of up 
to 5000 %, depending on the problem size, for a 50 % in- 
crease in system cost relative to the most widely used FSHT 



implementation in the HEALPix library running in parallel 
on a quad-core CPU. When compared to the more finely 
tuned libpsht FSHT library, again running on four cores, 
we still find significant performance gains, up to 1800 %. 

We assessed the numerical accuracy of the algorithm by 
comparing the convolved output map to the result gener- 
ated using HEALPix. The outcome typically agrees with 
the FSHT-based convolution to 1 part in 10 4 . Comparing 
the power spectrum of the output map to the exact solution 
for a narrow convolution kernel, we find a relative error of 
smaller than 10 -3 . For kernels that are not band- limited, 
the convolution with a FSHT-based scheme induces ringing 
artifacts. Using instead a slightly modified implementation 
of ARKCoS, however, we have demonstrated that a huge 
reduction in the spurious contribution is possible. 

The massively parallel hybrid approach we have pre- 
sented here is particularly advantageous for convolutions 
with compact kernels (with support less than ~ 15°) at 
high resolution (n S jd c = 512 or higher). The GPU we used 
for our tests has 1.5 GB of RAM. This is too small to store 
simultaneously the input and output HEALPix maps at 
resolution n s ;d c = 4096 or higher. Possible solutions for fu- 
ture work involve calculating the contribution to the output 
map from subsets of rings in the input map, cither sequen- 
tially or in parallel if more than one GPU is available in 
the same system. 

This work deals with radial kernel convolution. 
We note in closing that there is considerable in- 
terest in the algorithmically more difficult problem 



of a s ymmetric ke r nel c onvolution (IWandelt fc Gorski 



2001; IWiaux et al l 120051: ASYMFA S T. iTristram et al 
P0l FICS Bell. iHivo n fc Ponthieul [20111; FEBECoP : 



8 Note that in Figure [5] we show the power spectra multiplied 
by I (£ + l)/2n. This factor has to be taken into account when 
the dynamical range of the simulation is to be assessed. 



iMitra et al.l 1201 lj ) either to model the physical optics of 
CMB experiments more fait hfully (|Mennella et al.l 120111: 
iPlanck HFI Core Team et"aTl 1201 ll ) or to detect signals 
that have locally anisotropic signatures. Having found in 
this work that our hybrid algorithm vastly accelerates ra- 
dial kernel convolution, it is easy to imagine generalizations 
that accelerate asymmetric kernel c onvolution in a simi- 
lar way. The ASYMFAST approach (|Tristram et al.ll2004j) 
reduces the problem of asymmetric beam convolution to 
O(10) symmetric convolutions. Coupled to our GPU accel- 
erated approach, the convolution with even complex asym- 
metric kernels and compact support takes less time than 
the convolution with a single symmetric kernel on a CPU 
system using fast spherical harmonic transforms. 
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Fig. 4. Result of the accuracy test. The difference between a map smoothed in spherical harmonic space and a map 
smoothed with our hybrid method is at most 1.5 x 10~ 4 in a small number of outliers around the north pole (left panel). 
These are generated by numerical errors in the CUFFT library v. 3.1 for specific HEALPix ring lengths and are absent 
when using the slightly slower CUFFT v. 3.2. The larger scale 0(1O~ 5 ) error both at the pole and in the equatorial region 
(right panel) is caused by small inaccuracies in the kernel evaluation. In this test, a Gaussian kernel with 1° FWHM was 
used. Each patch is 10° on the side. 
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Fig. 5. Power spectrum accuracy. Left panel: for one particular realization, we plot the difference between the power 
spectrum of a map convolved with a narrow Gaussian, FWHM=6' and the power spectrum of the exact convolution. 
Right panel: we show the power spectrum of the difference map, computed from the convolution of a map using ARKCoS 
and the exact solution. For comparison, we also show the expected power spectrum of the exact convolution in both 
panels (dashed lines). Note that since we are comparing to the exact convolved power spectrum this error measure 
includes the HEALPix quadrature error. 
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Fig. 6. Reduced ringing-artifacts in our enhanced hybrid algorithm. Left panel: the convolution of point sources with a 
Gaussian kernel of 4.7' FWHM using a FSHT-based algorithm at n si d c = 2048 causes extended residuals. Right panel: 
the result obtained with ARKCoS only shows a suppressed ringing pattern in the longitudinal direction in the polar 
caps (upper four point sources) . In the equatorial region, the artifacts cancel out exactly (lower four point sources) . Each 
patch is 13° on the side, the logarithmic color scale counts representing factors of 10 from the maximum. For ARKCoS, 
the ringing patterns are too small to be visible on a linear scale. 



